{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bayesian Regression\n",
    "\n",
    "Regression is one of the most common and basic supervised learning tasks in machine learning. Suppose we're given a dataset $\\mathcal{D}$ of the form\n",
    "\n",
    "$$ \\mathcal{D}  = \\{ (X_i, y_i) \\} \\qquad \\text{for}\\qquad i=1,2,...,N$$\n",
    "\n",
    "The goal of linear regression is to fit a function to the data of the form:\n",
    "\n",
    "$$ y = w X + b + \\epsilon $$\n",
    "\n",
    "where $w$ and $b$ are learnable parameters and $\\epsilon$ represents observation noise. Specifically $w$ is a matrix of weights and $b$ is a bias vector.\n",
    "\n",
    "Let's first implement linear regression in PyTorch and learn point estimates for the parameters $w$ and $b$.  Then we'll see how to incorporate uncertainty into our estimates by using Pyro to implement Bayesian linear regression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "As always, let's begin by importing the modules we'll need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "from torch.autograd import Variable\n",
    "\n",
    "import pyro\n",
    "from pyro.distributions import Normal\n",
    "from pyro.infer import SVI\n",
    "from pyro.optim import Adam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "We'll generate a toy dataset with one feature and $w = 3$ and $b = 1$ as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 100  # size of toy data\n",
    "p = 1    # number of features\n",
    "\n",
    "def build_linear_dataset(N, noise_std=0.1):\n",
    "    X = np.linspace(-6, 6, num=N)\n",
    "    y = 3 * X + 1 + np.random.normal(0, noise_std, size=N)\n",
    "    X, y = X.reshape((N, 1)), y.reshape((N, 1))\n",
    "    X, y = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))\n",
    "    return torch.cat((X, y), 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we generate the data with a fixed observation noise $\\sigma = 0.1$.\n",
    "\n",
    "## Regression\n",
    "Now let's define our regression model in the form of a neural network. We'll use PyTorch's `nn.Module` for this.  Our input $X$ is a matrix of size $N \\times p$ and our output $y$ is a vector of size $p \\times 1$.  The function `nn.Linear(p, 1)` defines a linear transformation of the form $Xw + b$ where $w$ is the weight matrix and $b$ is the additive bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RegressionModel(nn.Module):\n",
    "    def __init__(self, p):\n",
    "        super(RegressionModel, self).__init__()\n",
    "        self.linear = nn.Linear(p, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.linear(x)\n",
    "\n",
    "regression_model = RegressionModel(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "We will use the mean squared error (MSE) as our loss and Adam as our optimizer. We would like to optimize the parameters of the `regression_model` neural net above. We will use a somewhat large learning rate of `0.01` and run for 500 iterations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss_fn = torch.nn.MSELoss(size_average=False)\n",
    "optim = torch.optim.Adam(regression_model.parameters(), lr=0.01)\n",
    "num_iterations = 500\n",
    "\n",
    "def main():\n",
    "    data = build_linear_dataset(N, p)\n",
    "    x_data = data[:, :-1]\n",
    "    y_data = data[:, -1]\n",
    "    for j in range(num_iterations):\n",
    "        # run the model forward on the data\n",
    "        y_pred = regression_model(x_data)\n",
    "        # calculate the mse loss\n",
    "        loss = loss_fn(y_pred, y_data)\n",
    "        # initialize gradients to zero\n",
    "        optim.zero_grad()\n",
    "        # backpropagate\n",
    "        loss.backward()\n",
    "        # take a gradient step\n",
    "        optim.step()\n",
    "        if (j + 1) % 50 == 0:\n",
    "            print(\"[iteration %04d] loss: %.4f\" % (j + 1, loss.data[0]))\n",
    "    # Inspect learned parameters\n",
    "    print(\"Learned parameters:\")\n",
    "    for name, param in regression_model.named_parameters():\n",
    "        print(\"%s: %.3f\" % (name, param.data.numpy()))\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Sample Output**:\n",
    "```\n",
    "[iteration 0050] loss: 4405.9824\n",
    "[iteration 0100] loss: 2636.4561\n",
    "[iteration 0150] loss: 1497.4330\n",
    "[iteration 0200] loss: 811.6588\n",
    "[iteration 0250] loss: 429.7031\n",
    "[iteration 0300] loss: 234.2143\n",
    "[iteration 0350] loss: 142.6732\n",
    "[iteration 0400] loss: 103.5445\n",
    "[iteration 0450] loss: 88.2896\n",
    "[iteration 0500] loss: 82.8650\n",
    "Learned parameters:\n",
    "linear.weight: 2.996\n",
    "linear.bias: 1.094\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not too bad - you can see that the neural net learned parameters that were pretty close to the ground truth of $w = 3, b = 1$.  But how confident should we be in these point estimates?\n",
    "\n",
    "Bayesian modeling (see [here](http://mlg.eng.cam.ac.uk/zoubin/papers/NatureReprint15.pdf) for an overview) offers a systematic framework for reasoning about model uncertainty. Instead of just learning point estimates, we're going to learn a _distribution_ over values of the parameters $w$ and $b$ that are consistent with the observed data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian Regression\n",
    "\n",
    "In order to make our linear regression Bayesian, we need to put priors on the parameters $w$ and $b$. These are distributions that represent our prior belief about reasonable values for $w$ and $b$ (before observing any data).\n",
    "\n",
    "### `random_module()`\n",
    "\n",
    "In order to do this, we'll 'lift' the parameters $w$ and $b$ to random variables. We can do this in Pyro via `random_module()`, which effectively takes a `nn.Module` and turns it into a distribution over neural networks. Specifically, each parameter in the original neural net is sampled from the provided prior. This allows us to repurpose vanilla neural nets for use in the Bayesian setting. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mu = Variable(torch.zeros(1, 1))\n",
    "sigma = Variable(torch.ones(1, 1))\n",
    "# define a unit normal prior\n",
    "prior = Normal(mu, sigma)\n",
    "# overload the parameters in the regression nn with samples from the prior\n",
    "lifted_module = pyro.random_module(\"regression_module\", regression_model, prior)\n",
    "# sample a nn from the prior\n",
    "sampled_nn = lifted_module()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model\n",
    "\n",
    "We now have all the ingredients needed to specify our model. First we define priors over $w$ and $b$.  Because we're uncertain about the parameters a priori, we'll use relatively wide priors $\\mathcal{N}(\\mu = 0, \\sigma = 10)$.  Then we wrap `regression_model` with `random_module` and sample an instance of the neural net, `lifted_nn`. We then run the neural net forward on the inputs `x_data`. Finally we use the `pyro.observe` statement to condition on the observed data `y_data`. Note that we use the same fixed observation "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def model(data):\n",
    "    # Create unit normal priors over the parameters\n",
    "    x_data = data[:, :-1]\n",
    "    y_data = data[:, -1]\n",
    "    mu, sigma = Variable(torch.zeros(p, 1)), Variable(10 * torch.ones(p, 1))\n",
    "    bias_mu, bias_sigma = Variable(torch.zeros(1)), Variable(10 * torch.ones(1))\n",
    "    w_prior, b_prior = Normal(mu, sigma), Normal(bias_mu, bias_sigma)\n",
    "    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}\n",
    "    # lift module parameters to random variables\n",
    "    lifted_module = pyro.random_module(\"module\", regression_model, priors)\n",
    "    # sample a nn (which also samples w and b)\n",
    "    lifted_nn = lifted_module()\n",
    "    # run the nn forward\n",
    "    latent = lifted_nn(x_data).squeeze()\n",
    "    # condition on the observed data\n",
    "    pyro.observe(\"obs\", Normal(latent, Variable(0.1 * torch.ones(data.size(0)))),\n",
    "                 y_data.squeeze())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Guide\n",
    "\n",
    "In order to do inference we're going to need a guide, i.e. a parameterized family of distributions over $w$ and $b$. Writing down a guide will proceed in close analogy to the construction of our model, with the key difference being that the guide parameters need to be trainable. To do this we register the guide parameters in the ParamStore using `pyro.param()` and make sure each PyTorch `Variable` has the flag `requires_grad` set to `True`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "softplus = torch.nn.Softplus()\n",
    "\n",
    "def guide(data):\n",
    "    # define our variational parameters\n",
    "    w_mu = Variable(torch.randn(p, 1), requires_grad=True)\n",
    "    # note that we initialize our sigmas to be pretty narrow\n",
    "    w_log_sig = Variable(-3.0 * torch.ones(p, 1) + 0.05 * torch.randn(p, 1), \n",
    "                         requires_grad=True)\n",
    "    b_mu = Variable(torch.randn(1), requires_grad=True)\n",
    "    b_log_sig = Variable(-3.0 * torch.ones(1) + 0.05 * torch.randn(1), \n",
    "                         requires_grad=True)\n",
    "    # register learnable params in the param store\n",
    "    mw_param = pyro.param(\"guide_mean_weight\", w_mu)\n",
    "    sw_param = softplus(pyro.param(\"guide_log_sigma_weight\", w_log_sig))\n",
    "    mb_param = pyro.param(\"guide_mean_bias\", b_mu)\n",
    "    sb_param = softplus(pyro.param(\"guide_log_sigma_bias\", b_log_sig))\n",
    "    # guide distributions for w and b\n",
    "    w_dist, b_dist = Normal(mw_param, sw_param), Normal(mb_param, sb_param)\n",
    "    dists = {'linear.weight': w_dist, 'linear.bias': b_dist}\n",
    "    # overload the parameters in the module with random samples \n",
    "    # from the guide distributions\n",
    "    lifted_module = pyro.random_module(\"module\", regression_model, dists)\n",
    "    # sample a nn (which also samples w and b)\n",
    "    return lifted_module()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we choose Gaussians for both guide distributions. Also, to ensure positivity, we pass each log sigma through a `softplus()` transformation.\n",
    "\n",
    "## Inference\n",
    "\n",
    "To do inference we'll use stochastic variational inference (SVI) (for an introduction to SVI, see [SVI Part I](svi_part_i)). Just like in the non-Bayesian linear regression, each iteration in our training loop will take a gradient step, with the difference being that in this case, we'll use the ELBO objective instead of the MSE loss.\n",
    "\n",
    "The Pyro backend will construct the ELBO objective function for us; this logic is handled by the `SVI` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optim = Adam({\"lr\": 0.01})\n",
    "svi = SVI(model, guide, optim, loss=\"ELBO\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here `Adam` is a thin wrapper around `torch.optim.Adam` (see [here](svi_part_i#Optimizers) for a discussion). The complete training loop is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def main():\n",
    "    pyro.clear_param_store()\n",
    "    data = build_linear_dataset(N, p)\n",
    "    for j in range(num_iterations):\n",
    "        # calculate the loss and take a gradient step\n",
    "        loss = svi.step(data)\n",
    "        if j % 100 == 0:\n",
    "            print(\"[iteration %04d] loss: %.4f\" % (j + 1, loss / float(N)))\n",
    "            \n",
    "if __name__ == '__main__':\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To take an ELBO gradient step we simply call the `step` method of `SVI`. Notice that the `data` argument we pass to `step` will be passed to both `model()` and `guide()`.\n",
    "\n",
    "## Validating Results\n",
    "Let's compare the variational parameters we learned to our previous result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name in pyro.get_param_store().get_all_param_names():\n",
    "    print(\"[%s]: %.3f\" % (name, pyro.param(name).data.numpy()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Sample Output**:\n",
    "```\n",
    "[guide_log_sigma_weight]: -3.217\n",
    "[guide_log_sigma_bias]: -3.164\n",
    "[guide_mean_weight]: 2.966\n",
    "[guide_mean_bias]: 0.941\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the means of our parameter estimates are pretty close to the values we previously learned. Now, however, instead of just point estimates, the parameters `guide_log_sigma_weight` and `guide_log_sigma_bias` provide us with uncertainty estimates.  (Note that the sigmas are in log-space here, so the more negative the value, the narrower the width)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's evaluate our model by checking its predictive accuracy on new test data. This is known as _point evaluation_.  We'll sample 20 neural nets from our posterior and run them on the new test data, then average across their predictions and calculate the MSE of the predicted values compared to the ground truth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.linspace(6, 7, num=20)\n",
    "y = 3 * X + 1\n",
    "X, y = X.reshape((20, 1)), y.reshape((20, 1))\n",
    "x_data, y_data = Variable(torch.Tensor(X)), Variable(torch.Tensor(y))\n",
    "loss = nn.MSELoss()\n",
    "y_preds = Variable(torch.zeros(20, 1))\n",
    "for i in range(20):\n",
    "    # guide does not require the data\n",
    "    sampled_nn = guide(None)\n",
    "    # run the nn and add to total\n",
    "    y_preds = y_preds + sampled_nn(x_data)\n",
    "# take the average of the predictions\n",
    "y_preds = y_preds / 20\n",
    "print \"Loss: \", loss(y_preds, y_data).data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Sample Output**:\n",
    "```\n",
    "Loss:  0.0310659464449\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the full code on [Github](https://github.com/uber/pyro/blob/dev/examples/bayesian_regression.py)."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
